What does the Test Say !!!

Data analytics approch to predict confirmed COVID-19 cases among suspected cases, based on the results of clinical tests.

By,
Name: Nithesh Nayak K

As governments worldwide are fast-tracking initiatives for the public most of the health safety, it's also important how effectively the initiatives are implemented. In those initiatives, the highest priority step is rapid testing. But because of the testing facilities and minimal knowledge of the virus strain, it has become one of the major reasons for wild spread.

The author is presenting some finding, based on the results of laboratory tests commonly done for a suspected COVID-19 case during visit to Emergency Room (ER) to show how an analytical approach caters for the given situation.

The case study gives a high-level view of how a comprehensive data-driven decision-making approach helps in tackling this pandemic, Where a multi Layer perceptron model is used (5 layers deep neural network with 6,12,18,24 and 30 neurons in each layer respectively) & explained of how this has resulted to be the best fit model for the given dataset.

As we know the neural Network model such as a black box when it comes to interpretability. To make the analytical decision-making process approach Fair, accountable and Transparent, LIME interpretability approach is implemented, which enables the model to assist decision making for health-care professional and doctors during COVID-19 clinical test analysis.

Overall the analytical approaches cater COVID-19 frontline workers to help with getting accurate test results by supporting the minimal clinical test to interpret the result. 
In [1]:
#Import Library
from io import StringIO
import pandas as pd
import numpy as np
from numpy import where

#for visualization
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import pyplot
from matplotlib import cm
from matplotlib.pyplot import figure
import seaborn as sns
sns.set()

#for pydot and graphviz
from IPython.display import Image  

#Metrics
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Machine Learning libraries
from sklearn.neural_network import MLPClassifier #For Multi-Layer Perceptron
from sklearn.model_selection import GridSearchCV #for GridSearch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, StandardScaler,MinMaxScaler,LabelEncoder # very important for feature transformation

#Logistic Regression 
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

#Explanibility
import lime
from lime import lime_tabular

# # #Ignore the Warnings 
import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_rows', 120) # Display upto 120 rows

DATA

  • Since the clinical test data wasn't readily available, the author used data from open kaggle source, which was from clinical spectrum Results of laboratory tests commonly collected for a suspected COVID-19 case during a visit to the emergency room at the Hospital Israelita Albert Einstein, at São Paulo, Brazil.
- As a first step, understanding the data and necessary normalisation and filtering process was done.
In [2]:
#Import Data 
Data=pd.read_csv('Diagnosis_of_COVID-19_and_its_clinical_spectrum.csv')
# Shape of the dataset
print(" The data set has total of "+str(Data.shape[0])+" entries and "+str(Data.shape[1])+" features")
 The data set has total of 5644 entries and 111 features
- Note: Here the feature represents the different clinical test, the same analogy will be used throughout the process. 
In [3]:
Data.head() #How the input data looks like
Out[3]:
Patient ID Patient age quantile SARS-Cov-2 exam result Patient addmited to regular ward (1=yes, 0=no) Patient addmited to semi-intensive unit (1=yes, 0=no) Patient addmited to intensive care unit (1=yes, 0=no) Hematocrit Hemoglobin Platelets Mean platelet volume ... Hb saturation (arterial blood gases) pCO2 (arterial blood gas analysis) Base excess (arterial blood gas analysis) pH (arterial blood gas analysis) Total CO2 (arterial blood gas analysis) HCO3 (arterial blood gas analysis) pO2 (arterial blood gas analysis) Arteiral Fio2 Phosphor ctO2 (arterial blood gas analysis)
0 44477f75e8169d2 13 negative 0 0 0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 126e9dd13932f68 17 negative 0 0 0 0.236515 -0.02234 -0.517413 0.010677 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 a46b4402a0e5696 8 negative 0 0 0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 f7d619a94f97c45 5 negative 0 0 0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 d9e41465789c2b5 15 negative 0 0 0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 111 columns

In [4]:
# #Undertanding the data distribution
# #Proportion of data
df2 = Data.copy()
df2 = df2.rename(columns={"Patient ID":"Patient_iD"})
proportion=(df2['SARS-Cov-2 exam result'].value_counts()/df2['SARS-Cov-2 exam result'].count())*100 # Calculating ratio of Patients tested postive and negative in the DataSet
print(proportion)

# fig1, ax1 = plt.subplots()
# ax1.pie([proportion], labels=['Negative cases', 'Positive cases'], autopct='%1.1f%%', startangle=90)
# ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
# plt.show()
negative    90.113395
positive     9.886605
Name: SARS-Cov-2 exam result, dtype: float64
The above pie chart depicts that the majority of data represents the negative cases, Only 10% of total data has Positive cases logged. 
In [5]:
print("Overall this dataset has "+str(round(100*df2.isna().to_numpy().sum()/(df2.shape[0]*Data.shape[1]),2)) + "% of missing values")
Overall this dataset has 88.06% of missing values
In [6]:
# Perventage of missing value for each features - Checking the data for missing value 
df_null_pct = df2.isna().mean().round(4) * 100 # Calculate mean percentage value of NaN values in each features (and multipled by 100 to get a understnding interms of percentage) 
df_null_pct.sort_values()
Out[6]:
Patient_iD                                                 0.00
Patient age quantile                                       0.00
SARS-Cov-2 exam result                                     0.00
Patient addmited to regular ward (1=yes, 0=no)             0.00
Patient addmited to semi-intensive unit (1=yes, 0=no)      0.00
Patient addmited to intensive care unit (1=yes, 0=no)      0.00
Influenza B                                               76.01
Respiratory Syncytial Virus                               76.01
Influenza A                                               76.01
Rhinovirus/Enterovirus                                    76.05
Inf A H1N1 2009                                           76.05
CoronavirusOC43                                           76.05
Coronavirus229E                                           76.05
Parainfluenza 4                                           76.05
Adenovirus                                                76.05
Chlamydophila pneumoniae                                  76.05
Parainfluenza 3                                           76.05
Coronavirus HKU1                                          76.05
CoronavirusNL63                                           76.05
Parainfluenza 1                                           76.05
Bordetella pertussis                                      76.05
Parainfluenza 2                                           76.05
Metapneumovirus                                           76.05
Influenza A, rapid test                                   85.47
Influenza B, rapid test                                   85.47
Hemoglobin                                                89.32
Hematocrit                                                89.32
Red blood cell distribution width (RDW)                   89.33
Platelets                                                 89.33
Mean corpuscular volume (MCV)                             89.33
Eosinophils                                               89.33
Mean corpuscular hemoglobin (MCH)                         89.33
Basophils                                                 89.33
Leukocytes                                                89.33
Mean corpuscular hemoglobin concentration (MCHC)          89.33
Lymphocytes                                               89.33
Red blood Cells                                           89.33
Monocytes                                                 89.35
Mean platelet volume                                      89.39
Neutrophils                                               90.91
Proteina C reativa mg/dL                                  91.03
Creatinine                                                92.49
Urea                                                      92.97
Potassium                                                 93.43
Sodium                                                    93.44
Strepto A                                                 94.12
Aspartate transaminase                                    96.00
Alanine transaminase                                      96.01
Serum Glucose                                             96.31
Total Bilirubin                                           96.78
Direct Bilirubin                                          96.78
Indirect Bilirubin                                        96.78
Gamma-glutamyltransferase                                 97.29
Alkaline phosphatase                                      97.45
HCO3 (venous blood gas analysis)                          97.59
pH (venous blood gas analysis)                            97.59
Total CO2 (venous blood gas analysis)                     97.59
Base excess (venous blood gas analysis)                   97.59
pO2 (venous blood gas analysis)                           97.59
pCO2 (venous blood gas analysis)                          97.59
Hb saturation (venous blood gas analysis)                 97.59
International normalized ratio (INR)                      97.64
Creatine phosphokinase (CPK)                              98.16
Lactic Dehydrogenase                                      98.21
Myeloblasts                                               98.28
Myelocytes                                                98.28
Metamyelocytes                                            98.28
Promyelocytes                                             98.28
Rods #                                                    98.28
Segmented                                                 98.28
Relationship (Patient/Normal)                             98.39
Urine - Crystals                                          98.76
Urine - Color                                             98.76
Urine - Yeasts                                            98.76
Urine - Red blood cells                                   98.76
Urine - Leukocytes                                        98.76
Urine - Density                                           98.76
Urine - Bile pigments                                     98.76
Urine - Hemoglobin                                        98.76
Urine - pH                                                98.76
Urine - Aspect                                            98.76
Urine - Urobilinogen                                      98.78
Urine - Granular cylinders                                98.78
Urine - Hyaline cylinders                                 98.81
Urine - Protein                                           98.94
Urine - Esterase                                          98.94
Urine - Ketone Bodies                                     98.99
Ionized calcium                                           99.11
Magnesium                                                 99.29
ctO2 (arterial blood gas analysis)                        99.52
Hb saturation (arterial blood gases)                      99.52
pH (arterial blood gas analysis)                          99.52
Arterial Lactic Acid                                      99.52
Total CO2 (arterial blood gas analysis)                   99.52
pCO2 (arterial blood gas analysis)                        99.52
HCO3 (arterial blood gas analysis)                        99.52
pO2 (arterial blood gas analysis)                         99.52
Base excess (arterial blood gas analysis)                 99.52
Ferritin                                                  99.59
Arteiral Fio2                                             99.65
Phosphor                                                  99.65
Albumin                                                   99.77
Lipase dosage                                             99.86
Vitamin B12                                               99.95
Urine - Nitrite                                           99.98
Fio2 (venous blood gas analysis)                          99.98
Partial thromboplastin time (PTT)                        100.00
Urine - Sugar                                            100.00
Mycoplasma pneumoniae                                    100.00
D-Dimer                                                  100.00
Prothrombin time (PT), Activity                          100.00
dtype: float64
The above list shows us that there are a lot of features which has a greater amount of missing value

As the next step author has taken approaches to deal with the missing data, selecting the features and get better positive and negative case proportion. As of now we can see the input data is banked highly towards negative cases. 

Data Filtering

Each feature has different types values converting the values to the appropriate data type with handling missing values are done in the below section. 
In [7]:
# Replacing categorical string names with values using masking 
mask = {'positive': 1,'negative': 0,'detected': 1, 'not_detected': 0,'not_done': np.NaN,'Não Realizado': np.NaN,'absent': 0,'Não Realizado':np.NaN, 'present': 1,'detected': 1, 'not_detected': 0,'normal': 1,
        'light_yellow': 1, 'yellow': 2, 'citrus_yellow': 3, 'orange': 4,'clear': 1, 'lightly_cloudy': 2, 'cloudy': 3, 'altered_coloring': 4,'<1000': 1000,'Ausentes': 0, 'Urato Amorfo --+': 1, 
        'Oxalato de Cálcio +++': 1,'Oxalato de Cálcio -++': 1, 'Urato Amorfo +++': 1}
df3 = df2.copy()
df3 = df2.replace(mask)
df3.head()
Out[7]:
Patient_iD Patient age quantile SARS-Cov-2 exam result Patient addmited to regular ward (1=yes, 0=no) Patient addmited to semi-intensive unit (1=yes, 0=no) Patient addmited to intensive care unit (1=yes, 0=no) Hematocrit Hemoglobin Platelets Mean platelet volume ... Hb saturation (arterial blood gases) pCO2 (arterial blood gas analysis) Base excess (arterial blood gas analysis) pH (arterial blood gas analysis) Total CO2 (arterial blood gas analysis) HCO3 (arterial blood gas analysis) pO2 (arterial blood gas analysis) Arteiral Fio2 Phosphor ctO2 (arterial blood gas analysis)
0 44477f75e8169d2 13 0 0 0 0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 126e9dd13932f68 17 0 0 0 0 0.236515 -0.02234 -0.517413 0.010677 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 a46b4402a0e5696 8 0 0 0 0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 f7d619a94f97c45 5 0 0 0 0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 d9e41465789c2b5 15 0 0 0 0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 111 columns

  • Handling Missing Data points for the resulted filtered dataset is performed. Handling Missing values also varies with the features, in the given dataset, for Influenza Rapid test missing values are replaced by zero since it is positive or negative. And all other features which are numerical (continuous data) points, Missing data is replaced with the mean of the feature, but again while repalcing it with mean, negative and positive cases are considered separately. All the tests in the features are grouped, together for better understanding.
In [8]:
pcs_vars = {'respiratory': ['Influenza B', 'Respiratory Syncytial Virus', 'Influenza A',
                            'Metapneumovirus', 'Parainfluenza 1', 'Inf A H1N1 2009',
                            'Bordetella pertussis', 'Chlamydophila pneumoniae', 'Coronavirus229E',
                            'Parainfluenza 3', 'CoronavirusNL63','Parainfluenza 4',
                            'Rhinovirus/Enterovirus', 'CoronavirusOC43', 'Coronavirus HKU1'],
            'regular_blood': ['Monocytes','Hemoglobin', 'Hematocrit',
                              'Red blood cell distribution width (RDW)', 'Red blood Cells',
                              'Platelets', 'Eosinophils', 'Basophils', 'Leukocytes',
                              'Mean corpuscular hemoglobin (MCH)', 'Mean corpuscular volume (MCV)',
                              'Lymphocytes'],
            'influenza_rapid': ['Influenza B, rapid test', 'Influenza A, rapid test']}
In [9]:
X_df =df3[['Patient age quantile']+['SARS-Cov-2 exam result'] + pcs_vars['regular_blood']+ pcs_vars['influenza_rapid'] +pcs_vars['respiratory']] # Selected varibales for training the model
In [10]:
#percentage of missing values in positive case
dataset_positive = X_df[X_df['SARS-Cov-2 exam result'] == 1]
total = dataset_positive.isnull().sum().sort_values(ascending=False)
percent = (dataset_positive.isnull().sum()/dataset_positive.isnull().count()).sort_values(ascending=False)
missing_data_positive = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data_positive.head()
Out[10]:
Total Percent
Influenza A, rapid test 496 0.888889
Influenza B, rapid test 496 0.888889
Monocytes 475 0.851254
Hemoglobin 475 0.851254
Hematocrit 475 0.851254
In [11]:
#percentage of missing values in negative case
dataset_negative = X_df[X_df['SARS-Cov-2 exam result'] == 0]
total = dataset_negative.isnull().sum().sort_values(ascending=False)
percent = (dataset_negative.isnull().sum()/dataset_negative.isnull().count()).sort_values(ascending=False)
missing_data_negative = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data_negative
Out[11]:
Total Percent
Monocytes 4568 0.898152
Red blood cell distribution width (RDW) 4567 0.897955
Red blood Cells 4567 0.897955
Platelets 4567 0.897955
Eosinophils 4567 0.897955
Basophils 4567 0.897955
Leukocytes 4567 0.897955
Mean corpuscular hemoglobin (MCH) 4567 0.897955
Mean corpuscular volume (MCV) 4567 0.897955
Lymphocytes 4567 0.897955
Hemoglobin 4566 0.897759
Hematocrit 4566 0.897759
Influenza A, rapid test 4328 0.850963
Influenza B, rapid test 4328 0.850963
CoronavirusOC43 3846 0.756193
Coronavirus HKU1 3846 0.756193
Metapneumovirus 3846 0.756193
Parainfluenza 1 3846 0.756193
Inf A H1N1 2009 3846 0.756193
Bordetella pertussis 3846 0.756193
Chlamydophila pneumoniae 3846 0.756193
Coronavirus229E 3846 0.756193
Parainfluenza 3 3846 0.756193
CoronavirusNL63 3846 0.756193
Parainfluenza 4 3846 0.756193
Rhinovirus/Enterovirus 3846 0.756193
Influenza B 3844 0.755800
Respiratory Syncytial Virus 3844 0.755800
Influenza A 3844 0.755800
SARS-Cov-2 exam result 0 0.000000
Patient age quantile 0 0.000000
In [12]:
#drop features from positive cases with more than 86% missing value
columns_to_exclude = missing_data_positive.index[missing_data_positive['Percent']> 0.86].tolist() #more than 86% missing value
X_df.drop(columns=columns_to_exclude, inplace=True)
print(columns_to_exclude)
X_df
['Influenza A, rapid test', 'Influenza B, rapid test']
Out[12]:
Patient age quantile SARS-Cov-2 exam result Monocytes Hemoglobin Hematocrit Red blood cell distribution width (RDW) Red blood Cells Platelets Eosinophils Basophils ... Inf A H1N1 2009 Bordetella pertussis Chlamydophila pneumoniae Coronavirus229E Parainfluenza 3 CoronavirusNL63 Parainfluenza 4 Rhinovirus/Enterovirus CoronavirusOC43 Coronavirus HKU1
0 13 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 17 0 0.357547 -0.022340 0.236515 -0.625073 0.102004 -0.517413 1.482158 -0.223767 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 8 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 5 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 15 0 NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5639 3 1 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5640 17 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5641 4 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5642 10 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5643 19 1 0.567652 0.541564 0.694287 -0.182790 0.578024 -0.906829 -0.835508 -1.140144 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5644 rows × 29 columns

In [13]:
# Redefine dataset positive and negative

dataset_negative = X_df[X_df['SARS-Cov-2 exam result'] == 0]
dataset_positive = X_df[X_df['SARS-Cov-2 exam result'] == 1]
dataset_negative = dataset_negative.dropna(axis=0, thresh=5) #minimum 5 features should be not be a NA value

DN=dataset_negative.fillna(dataset_negative.mean())
DP=dataset_positive.fillna(dataset_positive.mean())
In [14]:
#concatinate positive and negative cases together 
X = pd.concat([DN, DP])
nof_positive_cases = len(dataset_positive.index)
nof_negative_cases = len(dataset_negative.index)
In [15]:
fig1, ax1 = plt.subplots()
ax1.pie([nof_positive_cases, nof_negative_cases], labels=['Positive cases', 'Negative cases'], autopct='%1.1f%%', startangle=90)#, colors=['#c0ffd5', '#ffc0cb'])
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
Out[15]:
(-1.1114239947538227,
 1.1175127996736691,
 -1.1161018569206111,
 1.1007667924020406)
In [16]:
X #Dataset after all the data-preprocessing 
Out[16]:
Patient age quantile SARS-Cov-2 exam result Monocytes Hemoglobin Hematocrit Red blood cell distribution width (RDW) Red blood Cells Platelets Eosinophils Basophils ... Inf A H1N1 2009 Bordetella pertussis Chlamydophila pneumoniae Coronavirus229E Parainfluenza 3 CoronavirusNL63 Parainfluenza 4 Rhinovirus/Enterovirus CoronavirusOC43 Coronavirus HKU1
1 17 0 0.357547 -0.022340 0.236515 -0.625073 0.102004 -0.517413 1.482158 -0.223767 ... 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 1.000000 0.0 0.0
4 15 0 -0.078990 -0.042984 -0.041147 0.015938 -0.048516 0.112880 0.077025 0.025191 ... 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 1.000000 0.0 0.0
8 1 0 0.068652 -0.774212 -1.571682 -0.978899 -0.850035 1.429667 1.018625 -0.223767 ... 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0
9 17 0 -0.078990 -0.042984 -0.041147 0.015938 -0.048516 0.112880 0.077025 0.025191 ... 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0
13 13 0 -0.078990 -0.042984 -0.041147 0.015938 -0.048516 0.112880 0.077025 0.025191 ... 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5632 16 1 0.492976 0.262254 0.248097 -0.099663 0.303373 -0.705840 -0.481638 -0.157522 ... 0.0 0.0 0.0 0.008929 0.0 0.026786 0.0 0.053571 0.0 0.0
5633 4 1 0.492976 0.262254 0.248097 -0.099663 0.303373 -0.705840 -0.481638 -0.157522 ... 0.0 0.0 0.0 0.008929 0.0 0.026786 0.0 0.053571 0.0 0.0
5634 15 1 0.492976 0.262254 0.248097 -0.099663 0.303373 -0.705840 -0.481638 -0.157522 ... 0.0 0.0 0.0 0.008929 0.0 0.026786 0.0 0.053571 0.0 0.0
5639 3 1 0.492976 0.262254 0.248097 -0.099663 0.303373 -0.705840 -0.481638 -0.157522 ... 0.0 0.0 0.0 0.008929 0.0 0.026786 0.0 0.053571 0.0 0.0
5643 19 1 0.567652 0.541564 0.694287 -0.182790 0.578024 -0.906829 -0.835508 -1.140144 ... 0.0 0.0 0.0 0.008929 0.0 0.026786 0.0 0.053571 0.0 0.0

2005 rows × 29 columns

Each feature in the dataset has a different correlation in the set of features, so let's see how feature correlation is 
distributed over the given dataset. 
In [17]:
corrmat = abs(X.corr())

# Correlation with output variable
cor_target = corrmat["SARS-Cov-2 exam result"]

# Selecting highly correlated features
relevant_features = cor_target[cor_target>0.05].index.tolist()

# f, ax = plt.subplots(figsize=(16, 8))
# sns.heatmap(abs(X[relevant_features].corr().iloc[0:1, :]), yticklabels=[relevant_features[0]], xticklabels=relevant_features, vmin =  0.0, square=True, annot=True, vmax=1.0, cmap='RdPu')
In [18]:
X_with_relevant_features = X[relevant_features]
y_with_relevant_features = X_with_relevant_features['SARS-Cov-2 exam result']
X_with_relevant_features.drop(columns=['SARS-Cov-2 exam result'], inplace=True)
XX=X_with_relevant_features.copy()

Analytical model - Neural Network Approach

  • The above section gave us an insight into how is the data-set is structured. So considering information obtained during the data preprocessing, we can say that the data set has a lot of features and entries. Adding to this there is significant evidence of missing data points. So considering these factors in account, the author feels that a neural network might be a good fit. So to explore approach, the author has tried different architectural of Multilayer perceptron model with cross-validation method to get the best fit model.
In [19]:
#Setting new x and y values with the new dataframe. 

# # One hot encoding is done for target variable
enc = OneHotEncoder()

Y = enc.fit_transform(y_with_relevant_features[:, np.newaxis]).toarray()

rs = 41 #Random State
X_train, X_test, y_train, y_test = train_test_split(XX, Y, test_size = 0.30, random_state = rs) #Split the data for train and Test with a ratio of 70:30 
In [20]:
# Shape of the dataset for train and test , without target Variable
print(" The data set has total of "+str(X.shape[0])+" entries and "+str(X.shape[1])+" features")
 The data set has total of 2005 entries and 29 features
Since there was a total of 30 features, the author has used an MLP architecture with 5 hidden layers having 6, 12, 18,24 and 30 neurons ( nodes) in each layer, and training with a learning rate of 0.0001 (default) and running for 400 epochs (iterations) 
In [21]:
#Default Neural Network - Multi-Layer Perceptron

modelNN = MLPClassifier(hidden_layer_sizes=(6,12,18,24,30),random_state=rs,
                         solver='adam',max_iter=400) #Running a deafult multi-layer perceptron with Random state rs= 41
modelNN.fit(X_train, y_train)

print("Parameters used in building this model are:\n")
print(modelNN) # Inforamtion about the NEural Netwrok architecture used

#Test and Train Accuracy
print("\nClassification accuracy on training and test datasets are:\n")
print("Train accuracy:", modelNN.score(X_train, y_train))
print("Test accuracy:", modelNN.score(X_test, y_test))

#Model Prediction
y_pred_modelNN = modelNN.predict(X_test)
print("\n Classification Report: \n")
# Print Classification Report 
print(classification_report(y_test, y_pred_modelNN))
Parameters used in building this model are:

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(6, 12, 18, 24, 30), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=400,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=41, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

Classification accuracy on training and test datasets are:

Train accuracy: 0.9793300071275838
Test accuracy: 0.9634551495016611

 Classification Report: 

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       420
           1       0.95      0.92      0.94       182

   micro avg       0.96      0.96      0.96       602
   macro avg       0.96      0.95      0.96       602
weighted avg       0.96      0.96      0.96       602
 samples avg       0.96      0.96      0.96       602

In [22]:
#Variation of Cost over the iteration 
plt.ylabel('cost')
plt.xlabel('iterations')
plt.title("Learning rate =" + str(0.0001))
plt.plot(modelNN.loss_curve_)
plt.show()
The above graph shows us how the cost (Error) decreases with an increase in several iterations. We can see that with the 
above mentioned neural network parameters and architecture, the gradient descent converged local minima producing best 
fit model. 


Result of best fit model is : 97.93 % Test accuracy and 96.34% Train accuracy within total of 400 iterations and alpha(learning rate) of 0.0001

Understanding Classification Report :

  • Precision : Ratio of how many positive predictions are actual positive observations to the all positive predictions.

      True Positive / ( True Positive + False Positive)
  • Recall : Ratio of number of positive prediction that are actually postive to total number of prediction.

      Recall = True Positive /(True Positive +False Negative) 
  • F1 Score : The harmonic mean (average) of precision and recall. F1-score =(2 X recall X precision)/(recall + precision)

  • Note: These where explained for positive class (1), it will same for negative class (0)

  • The author has tried several hyperparameters and neural network model tuning approaches like Grid Search Cross-validation (With k-fold Cross Validation) and Dimensionality Reduction (Recursive Feature Elimination Method using Logistive Regression as base elimination model).
  • Since there was no significant improvement in the model output, as compared to the above model. Those were not considered for the presentation of the case study. Some main take away from those approaches where,
- Since the data is more banked towards Negative patients and very less than on positive COVID-19 patient, 
    grid search, with k-fold CV coudnt make much difference. 


- When it comes to Dimensionality reduction, Feature importance was very close by and it selected 29 features out 30,
    which was dint had any improvement in model learning and interpretation. 


When we deploy this analytical model for a decision-making process like one we have now in this scenario, the neural network model can not interpret the result. Since these model output would be a support for health care workers and doctors who are at the frontline of battling this pandemic. It would not be of any use if we gave a model which acts like a black - box, where you put some value from one end and it gives the result. Any slight mistake in the decision taken by them will risk a large amount of population, not only the patient.

        To resolve this issue and give a better insight for health care professionals on the test report,(also to 
resolve the ethical concern of approval Board, for this methodology implementing in health sector)the author has 
provided the best fit model with a model with interpretability feature.  

Interpreting the output

Author has used Local Interpretable Model-Agnostic Explanations concept to provide an explanation for predicted model result

In [23]:
# Finding Confuation Matrix - Whihc gives an insights on how well the model has predicted the output and what the ratio of false positives and False Negatives
groundtruth = enc.inverse_transform( y_test )  #Actual Test result
predictions = enc.inverse_transform(modelNN.predict( X_test )) #Predicted Test Results
mat = confusion_matrix(groundtruth, predictions) #Creating Confusion Matrix
print(mat.T)

#Creating the HEatMAp of Confusion Matrix
sns.heatmap(mat.T, square=True, cbar=True, xticklabels=["Negative", "Positive"], \
            yticklabels=[" Negative", "Positive"], annot=True, cmap=cm.viridis)

#Grpah Paramters 
plt.xlabel('true label')
plt.ylabel('predicted label');
[[412  14]
 [  8 168]]
Let's have a look at how well the model is predicting in comparison with the actual result,
In [24]:
#Merging PRedicted Test result and actual test results

pred = enc.inverse_transform(modelNN.predict( X_test ))
X_Converted=XX.to_numpy()
display_Limit = 0
for patient_indx in range(0, len(X)):

    patients_feat = X_Converted[patient_indx,:]
    patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0]

    # prediction
    pred = modelNN.predict(np.expand_dims(patients_feat, 0))
    pred = enc.inverse_transform( pred )[0][0]
    
    #for the purpose of display only 10 vlaues are shown, Commenting if Stament will allow to print all patient information
    if display_Limit <= 10:
        print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n" 
          %(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", 
            "Positive" if patients_true_pred else "Negative"))
        display_Limit +=1
Patient Number : 0 	 Patient id : 44477f75e8169d2 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 1 	 Patient id : 126e9dd13932f68 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 2 	 Patient id : a46b4402a0e5696 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 3 	 Patient id : f7d619a94f97c45 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 4 	 Patient id : d9e41465789c2b5 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 5 	 Patient id : 75f16746216c4d1 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 6 	 Patient id : 2a2245e360808d7 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 7 	 Patient id : 509197ec73f1400 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 8 	 Patient id : 8bb9d64f0215244 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 9 	 Patient id : 5f1ed301375586c 	 Predicted: Negative 	 True Diagnosis: Negative

Patient Number : 10 	 Patient id : d720464cc322b6f 	 Predicted: Negative 	 True Diagnosis: Negative

  • So let's consider four types of results we observe from the model, which are very crucial for the doctors and healthcare professionals. which are False positive, False Negative, True Positive?

    • Let's get a patient for each of this condition for the test dataset,
    - Patient Number 557 - predicted Positive, True value Negative -False Positive 


    - Patient Number 1467 - predicted Negative, True value Positive - False Negative 


    - Patient Number 1368 - predicted Negative, True value Negative - True Negative 


    - Patient Number 1470 - predicted Positive, True value Positive - True Positive 
In [25]:
#Selcting all the feature from the main DataSet 
feature_names = XX.columns.to_list() 
len(feature_names)
feature_names
Out[25]:
['Patient age quantile',
 'Monocytes',
 'Hemoglobin',
 'Hematocrit',
 'Red blood cell distribution width (RDW)',
 'Red blood Cells',
 'Platelets',
 'Eosinophils',
 'Basophils',
 'Leukocytes',
 'Mean corpuscular hemoglobin (MCH)',
 'Mean corpuscular volume (MCV)',
 'Influenza B',
 'Respiratory Syncytial Virus',
 'Influenza A',
 'Metapneumovirus',
 'Inf A H1N1 2009',
 'Parainfluenza 3',
 'Parainfluenza 4',
 'Rhinovirus/Enterovirus',
 'Coronavirus HKU1']
In [26]:
# LIME has one explainer for all the models
# very important for feature transformation
X_train=X_train.to_numpy()
MAX_FEAT = 15 #Features to be considered for Explanability
explainer = lime_tabular.LimeTabularExplainer(X_train, feature_names= feature_names, class_names=["Negative", "Positive"], verbose=False, mode='classification')

Patient Number: 557 - Predicted Positive, True value Negative - False Positive

In [27]:
# patient num 557 predicted Positive, true value Negative -False Positive

patient_indx = 557 # Patient Number 

# Features of given patient - Test results of given patient number 
patients_feat = X_Converted[patient_indx,:]

# Convert back from one hot encoding 1-D array, True(Actual) value of Test Result
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0] 

# prediction 
pred = modelNN.predict(np.expand_dims(patients_feat, 0)) # For given patient Number predicted values 
pred = enc.inverse_transform( pred )[0][0] # converting back from one hot encoding to single value, for predicted test result

#Printing patient number, predicted test result and actual test result
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n" 
      %(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", "Positive" if patients_true_pred else "Negative"))
Patient Number : 557 	 Patient id : a83e83fca06c338 	 Predicted: Positve 	 True Diagnosis: Negative

In [28]:
# explain instance
exp_type01 = explainer.explain_instance(patients_feat, modelNN.predict_proba, num_features= MAX_FEAT )
# Show the predictions
exp_type01.show_in_notebook(show_table=True )

The above visualization tells us that - starting from left,

  • How confident is the model the given patient is COVID-19 positive or negative ( out of 1 ), to convert it into a percentage in this scenario we can say the model is 73% sure that the patient is Positive and 26 % doubt that, it may be negative.

    • For the given scenario this would tell the healthcare professional that how confident one can be with the results, in this scenario since its False positive, the probability indicates that there is a 26% chance of being negative.
  • The visualisation in the middle of the screen shows that, how each feature are contributing to the model prediction, the value just above the horizontal bar graph indicates the confidence values, and its in the order of features with highest confidence values to lowest ( considering top 10 features )

    • For the given Scenario, after getting the probability of chance, the doctors may want to see, depending on what features the results are shown. based on their expertise and medical knowledge then can decide on what can be done for the patient.
  • The rightmost visualization shows us the actual values of these top 10 features from the dataset.

    • For the given scenario, When doctors going to make a decision, these values help than to give a quick sneak peek of the test results.
  • Note: Studies have shown that COVID-19 effect commonly to Red blood Cells,RBC's (Amdahl, 2020), from there the effect on RBC's echos to other elements. So we can say as a health professional, our first glance would be test realting to RBC's ( there are different test which promarily based on RBC's), how the model supporting ( confidence interval) for RBS realted tests.

Patient Number:557 - Test result Insight:

So to summarise this situation, looking at the probability model and confidence value, since the confidence of being positive, features wise also not significantly high, doctors/health professionals may have to do a specify the extra test to confirm it whether is negative, since one of the RBS test is in support of COVID-19 being negative. Confidence interval of Features and its values supporting healthcare professionals to take necessary steps, not only that but also to make this decision making the process faster.

A much more detailed split of the confidence interval is given below, to get confidence result with higher resolution.
In [29]:
#print(exp_type01.as_list()) #Print all the Lsit of top features and its Effect
explanation_plot = exp_type01.as_pyplot_figure() #Plot showing variation in feature importance

Patient Number:1467- Predicted Negative, True value Positive -False Negative

In [30]:
### patient num 1467  predicted Negative. True value Positive - False Negative

patient_indx = 1467 # Patient Number 

# Features of given patient - Test results of given patient number 
patients_feat = X_Converted[patient_indx,:]

# Convert back from one hot encoding 1-D array, True(Actual) value of Test Result
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0] 

# prediction
pred = modelNN.predict(np.expand_dims(patients_feat, 0)) # For given patient Number predicted values 
pred = enc.inverse_transform( pred )[0][0] # converting back from one hot encoding to single value, for predicted test result

#Printing patient number, predicted test result and actual test result
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n" 
      %(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", "Positive" if patients_true_pred else "Negative"))
Patient Number : 1467 	 Patient id : 2be1930c8e49989 	 Predicted: Negative 	 True Diagnosis: Positive

In [31]:
# explain instance
exp_type02 = explainer.explain_instance(patients_feat, modelNN.predict_proba, num_features= MAX_FEAT )
# Show the predictions
exp_type02.show_in_notebook(show_table=True )

Patient Number:1467 - Test result Insight:

  • So to summarise this situation, the decisions are at high stakes, since a COVID-19 positive patient is marked as negative by the model, looking at the confusion matrix, around 9% of the time this might show up by the model, looking at the confidence interval shows that there very less features supporting its negative and lot of features including RBC's test supports it to be positive.
  • We have seen higher instances like this where the patient is tested negative primarily and then after the reset, one is confirmed positive (Guardian, 2020), so one has to be very careful with these scenarios. since the Confidence interval of Features and its values supporting healthcare professionals to take necessary steps, not only that but also to make this decision making the process faster.
- A much more detailed split of the confidence interval is given below, to get confidence result with higher resolution.
In [32]:
#print(exp_type02.as_list()) #Print all the Lsit of top features and its Effect
explanation_plot = exp_type02.as_pyplot_figure() #Plot showing variation in feature importance

Patient Number:1368 - Predicted Negative, True value Negative -True Negative

In [33]:
### patient num 1368  predicted Negative. True value Positive - False Negative

patient_indx = 1368 # Patient Number 

# Features of given patient - Test results of given patient number 
patients_feat = X_Converted[patient_indx,:]

# Convert back from one hot encoding 1-D array, True(Actual) value of Test Result
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0] 

# prediction
pred = modelNN.predict(np.expand_dims(patients_feat, 0)) # For given patient Number predicted values 
pred = enc.inverse_transform( pred )[0][0] # converting back from one hot encoding to single value, for predicted test result

#Printing patient number, predicted test result and actual test result
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n" 
      %(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", "Positive" if patients_true_pred else "Negative"))
Patient Number : 1368 	 Patient id : c669452253b8699 	 Predicted: Negative 	 True Diagnosis: Negative

In [34]:
# explain instance
exp_type02 = explainer.explain_instance(patients_feat, modelNN.predict_proba, num_features= MAX_FEAT )
# Show the predictions
exp_type02.show_in_notebook(show_table=True )

Patient Number: 1368 - Test result Insight:

So to summarise this situation, looking at the probability model and confidence value doctors/heatlh professionals can confirm it whether is actually Negative, since Confidence interval of Features  including RBC's test support of outcome being negative. This type of patients effective filtering helps to lower the burden on health care professional, not only that but also to supports fasten the decision making process.

A much more deatiled split of confidence interval is given below, to get confidence result with higher resolution.

Patient Number 1470 predicted Positive, True value Positive -True Positve

In [35]:
# patient Number 1470 predicted Positive. true value Positive - True Positve

patient_indx = 1470# Patient Number 

# Features of given patient - Test results of given patient number 
patients_feat = X_Converted[patient_indx,:]

# Convert back from one hot encoding 1-D array, True(Actual) value of Test Result
patients_true_pred = enc.inverse_transform(np.expand_dims(Y[patient_indx,:], 0))[0][0] 

# prediction
pred = modelNN.predict(np.expand_dims(patients_feat, 0)) # For given patient Number predicted values 
pred = enc.inverse_transform( pred )[0][0] # converting back from one hot encoding to single value, for predicted test result

#Printing patient number, predicted test result and actual test result
print("Patient Number : %d \t Patient id : %s \t Predicted: %s \t True Diagnosis: %s\n" 
      %(patient_indx,df3['Patient_iD'][patient_indx], "Positve" if pred else "Negative", "Positive" if patients_true_pred else "Negative"))
Patient Number : 1470 	 Patient id : 804cbf88e1b8333 	 Predicted: Positve 	 True Diagnosis: Positive

In [36]:
# explain instance
exp_type04 = explainer.explain_instance(patients_feat, modelNN.predict_proba, num_features= MAX_FEAT )
# Show the predictions
exp_type04.show_in_notebook(show_table=True )

Patient Number: 1470 - Test result Insight:

  • So to summarise this situation, looking at the probability model and confidence value doctors/health professionals can confirm it is positive, comparing to the features supporting positive. COVID-19 positive.
- A much more detailed split of the confidence interval is given below, to get confidence result with higher resolution.
In [37]:
#print(exp_type04.as_list()) #Print all the Lsit of top features and its Effect
explanation_plot = exp_type04.as_pyplot_figure() #Plot showing variation in feature importance

Insights

  • The case study gives a complete data analytics approach using advanced techniques to cater for the given scenario.
  • There is some limitation when it comes to modelling and data aspects of the case study, the data set unbalanced. there are around 90% negative cases and only 10% positive cases in the dataset, which gives us the limitation of less learning exposure of model to positive data set and also even interpretability model will not be accurate with False Negative patients, which is also a serious matter for concern.
  • Nevertheless, this gives a helicopter view of how a data-driven decision making helps in solving the crisis at hand. If well-distributed data is provided to the author, the above-mentioned shortcoming can be reduced. ( Especially characteristics of COIVD-19 Positive patients)
  • As the main business concern was to control the outbreak of this pandemic, this approach enables all the front-line workers and helpers to spread out and do accurate rapid testing.
  • Which also helps in increasing number of mobile test units, even though the number of health workers is less, the clinical test kit is available now in a mobile test kit version and with this analytical approach, it empowers the other responsible frontline workers to take the test and recognize the results accurately.
  • Not only that we have seen there are case were False Negative comes, in those scenarios one can remotely consult the health workers.
  • As we see rapid testing is one of the good strategies to strategically stop the spread, these type of analytics-enabled mobile test kits might have to install in several places like airports, schools and offices, to stop the second wave of COVID-19 in countries like Australia where it has successfully flattened the curve. (Stevens, 2020)

Data Source

References